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During speech perception, humans integrate auditory information from the voice with 
visual information from the face. This multisensory integration increases perceptual 
precision, but only if the two cues come from the same talker; this requirement has 
been largely ignored by current models of speech perception. We describe a generative 
model of multisensory speech perception that includes this critical step of determining 
the likelihood that the voice and face information have a common cause. A key feature 
of the model is that it is based on a principled analysis of how an observer should solve 
this causal inference problem using the asynchrony between two cues and the reliability 
of the cues. This allows the model to make predictions about the behavior of subjects 
performing a synchrony judgment task, predictive power that does not exist in other 
approaches, such as post-hoc fitting of Gaussian curves to behavioral data. We tested 
the model predictions against the performance of 37 subjects performing a synchrony 
judgment task viewing audiovisual speech under a variety of manipulations, including 
varying asynchronies, intelligibility, and visual cue reliability. The causal inference model 
outperformed the Gaussian model across two experiments, providing a better fit to the 
behavioral data with fewer parameters. Because the causal inference model is derived 
from a principled understanding of the task, model parameters are directly interpretable 
in terms of stimulus and subject properties. 
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INTRODUCTION 

When an observer hears a voice and sees mouth movements, 
there are two potential causal structures (Figure lA). In the first 
causal structure, the events have a common cause (C = 1): a sin- 
gle talker produces the voice heard and the mouth movements 
seen. In the second causal structure, the events have two differ- 
ent causes (C = 2): one talker produces the auditory voice and 
a different talker produces the seen mouth movements. When 
there is a single talker, integrating the auditory and visual speech 
information increases perceptual accuracy (Sumby and Pollack, 
1954; Rosenblum et al, 1996; Schwartz et al, 2004; Ma et al, 

2009) ; most computational work on audiovisual integration of 
speech has focused on this condition (Massaro, 1989; Massaro 
et al., 2001; Bejjanki et al, 2011). However, if there are two talkers, 
integrating the auditory and visual information actually decreases 
perceptual accuracy (Kording et al., 2007; Shams and Beierholm, 

2010) . Therefore, a critical step in audiovisual integration dur- 
ing speech perception is estimating the likelihood that the speech 
arises from a single talker. This process, known as causal infer- 
ence (Kording et al, 2007; Schutz and Kubovy, 2009; Shams and 
Beierholm, 2010; Buehner, 2012), has provided an excellent tool 
for understanding the behavioral properties of tasks requiring 
spatial localization of simple auditory beeps and visual flashes 
(Kording et al, 2007; Sato et al, 2007). However, multisensory 
speech perception is a complex and highly-specialized compu- 
tation that takes place in brain areas distinct from those that 
perform audiovisual localization (Beauchamp et al., 2004). Unlike 



spatial localization, in which subjects estimate the continuous 
variable of location, speech perception is inherently multidimen- 
sional (Ma et al, 2009) and requires categorical decision making 
(Bejjanki et al., 201 1). Therefore, we set out to determine whether 
the causal inference model could explain the behavior of humans 
perceiving multisensory speech. 

Manipulating the asynchrony between the auditory and visual 
components of speech dramatically affects audiovisual integra- 
tion. Therefore, synchrony judgment tasks are widely used in the 
audiovisual speech literature and have been used to character- 
ize both individual differences in speech perception in healthy 
subjects and group differences between healthy and clinical pop- 
ulations (Lachs and Hernandez, 1998; Conrey and Pisoni, 2006; 
Smith and Bennetto, 2007; Rouger et al., 2008; Foss-Feig et al., 
2010; Navarra et al, 2010; Stevenson et al., 2010; Vroomen and 
Keetels, 2010). While these behavioral studies provide valuable 
descriptions of behavior, the lack of a principled, quantitative 
foundation is a fundamental limitation. In these studies, syn- 
chrony judgment data are fit with a series of Gaussian curves 
without a principled justification for why synchrony data should 
be Gaussian in shape. A key advantage of the causal inference 
model is that the model parameters, such as the sensory noise in 
each perceiver, can be directly related to the neural mechanisms 
underlying speech perception. This stands in sharp contrast to the 
Gaussian model's explanations based solely on descriptive mea- 
sures (most often the standard deviation of the fitted curve). The 
causal inference model generates behavioral predictions based on 
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FIGURE 1 I Causal structure of audiovisual speecli. (A) Causal diagram 
for audiovisual speech emanating from a single talker (C = 1) or two talkers 
(C = 2). (B) Difference between auditory and visual speech onsets showing 
a narrow distribution for C = 1 (navy) and a broad distribution for C = 2 
(gray). The first x-axis shows the onset difference in the reference frame of 
physical asynchrony. The second x-axis shows the onset difference in 
reference frame of the stimulus (audio/visual offset created by shifting the 
auditory speech relative to the visual speech). A recording of natural speech 
without any manipulation corresponds to zero offset in the stimulus 
reference frame and a positive offset in the physical asynchrony reference 
frame because visual mouth opening precedes auditory voice onset. (C) 
For any given physical asynchrony (A) there is a distribution of measured 
asynchronies (with standard deviation a) because of sensory noise. (D) 
Combining the likelihood of each physical asynchrony (B) with sensory 
noise (C) allows calculation of the measured asynchrony distributions 
across all physical asynchronies. Between the dashed lines, the likelihood 
of C = 1 is greater than the likelihood of C = 2. 



an analysis of how the brain might best perform the task, rather 
than seeking a best-fitting function for the behavioral data. 

CAUSAL INFERENCE OF ASYNCHRONOUS AUDIOVISUAL 
SPEECH 

The core of the causal inference model is a first-principles 
analysis of how the relationship between cues can be used to 
determine the likelihood of a single talker or multiple talk- 
ers. Natural auditory and visual speech emanating from the 
same talker (C = 1) contains a small delay between the visual 
onset and the auditory onset caused by the talker preparing the 
facial musculature for the upcoming vocalization before engaging 



the vocal cords. This delay results in the distribution of asyn- 
chronies having a positive mean when measured by the phys- 
ical difference between the auditory and visual stimulus onsets 
(Figure IB). When there are two talkers (C = 2), there is no rela- 
tionship between the visual and auditory onsets, resulting in a 
broad distribution of physical asynchronies. Observers do not 
have perfect knowledge of the physical asynchrony, but instead 
their measurements are subject to sensory noise (Figure IC). 
An observer's measured asynchrony therefore follows a distri- 
bution that is broader than the physical asynchrony distribu- 
tion. Overlaying these distributions shows that there is a win- 
dow of measured asynchronies for which C = 1 is more likely 
than C = 2 (Figure ID). This region is the Bayes-optimal syn- 
chrony window and is used by the observer to make the syn- 
chronous/asynchronous decision; this window does not change 
based on the physical asynchrony, which is unknown to the 
observer. 

During perception of a multisensory speech event, the mea- 
sured onsets for the auditory and visual cues are corrupted by 
sensory noise (Ma, 2012); these measurements are subtracted 
to produce the measured asynchrony {x; Figure 2A). Critically, 
observers use this measured asynchrony, rather than the physi- 
cal asynchrony, to infer the causal structure of the speech event. 
Because the sensory noise has zero mean, the physical asynchrony 
determines the mean of the measured asynchrony distribution. 
Thus, synchrony perception depends both on the physical asyn- 
chrony and the observer's sensory noise. For example, if the visual 
cue leads the auditory cue by 100 ms (physical asynchrony = 
100 ms, the approximate delay for cues from the same talker), 
then the measured asynchrony is likely to fall within the Bayes- 
optimal synchrony window, and the observer is likely to respond 
synchronous (Figure 2B). By contrast, if the visual cue trails 
the auditory cue by 100 ms (physical asynchrony = —100 ms) 
then the measured asynchrony is unlikely to fall within the syn- 
chrony window and the observer is unlikely to report a common 
cause (Figure 2C). Calculating the likelihood that a measured 
asynchrony falls within the synchrony window for each physical 
asynchrony at a given level of sensory noise produces the pre- 
dicted behavioral curve (Figure 2D). Because the sensory noise 
is modeled as changing randomly from trial-to-trial, the model is 
probabilistic, calculating the probability of how an observer will 
respond, rather than deterministic, with a fixed response for any 
physical asynchrony (Ma, 2012). 

MATERIALS AND METHODS 
BEHAVIORAL TESTING PROCEDURE 

Human subjects approval and subject consent were obtained 
for all experiments. Participants (« = 39) were undergraduates 
at Rice University who received course credit. All participants 
reported normal or corrected-to-normal vision and hearing. 

Stimuli were presented on a 15" Macbook Pro Laptop (2008 
model) using Matlab 2010a with the Psychophysics Toolbox 
extensions (Brainard, 1997; Pelli, 1997) running at 1440 x 900 
(width X height) resolution. Viewing distance was ~40cm. 
A lamp behind the participants provided low ambient lighting. 
Sounds were presented using KOSS UR40 headphones. The vol- 
ume was set at a comfortable level for each individual participant. 
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FIGURE 2 I Synchrony judgments under the causal Inference model. 

(A) On each trial, observers obtain a measurement of the audiovisual 
asynchrony by differencing measurements of the auditory (magenta) and 
visual (green) onsets. Because of sensory noise, the measured asynchrony 
(x, blue) is different than the physical asynchrony (A, purple). (B) For a given 
physical asynchrony, A = 100 ms, there is a range of possible measured 
asynchronies (x, blue). The shaded region indicates values of x for which 
C = 1 is more probable than C = 2 (Figure ID). The area of the shaded 
region is the probability of a synchronous percept, p(Sync). (C) For a 
different physical asynchrony, A = —100 ms, there is a different distribution 
of measured asynchronies, with a lower probability of a synchronous 
percept. (D) The probability of a synchronous percept for different physical 
asynchronies. Purple markers show the predictions for A = 100 ms and 
A = -100 ms. 



Trials began with the presentation of a white fixation cross in 
a central position on the screen for 1.2 s, followed by presentation 
of the audiovisual recording of a single word (~2 s), and then the 
reappearance of the fixation cross until the behavioral response 
was recorded. Participants were instructed to press the "m" key if 
the audio and visual speech were perceived as synchronous, and 
the "n" key if perceived as asynchronous. 

STIMULI 

The stimuli consisted of audiovisual recordings of spoken words 
from previous studies of audiovisual speech synchrony judgments 
(Lachs and Hernandez, 1998; Conrey and Pisoni, 2004) obtained 
by requesting them from the authors. The stimuli were all 
640 X 480 pixels in size. The first stimulus set consisted of record- 
ings of four words ("doubt," "knot," "loan," "reed") selected to 
have high visual intelligibility, as determined by assessing visual- 
only identification performance (Conrey and Pisoni, 2004). The 
temporal asynchrony of the auditory and visual components of 
the recordings was manipulated, ranging from —300 ms (audio 
ahead) to -1-500 ms (video ahead), with 15 total asynchronies 
(-300, -267, -200, -133, -100, -67, 0, 67, 100, 133, 200, 267, 



300, 400, and 500 ms). The visual-leading half of the curve was 
over-sampled because synchrony judgment curves have a peak 
shifted to the right of 0 (Vroomen and Keetels, 2010). The sec- 
ond stimulus set contained blurry versions of these words (at the 
same 15 asynchronies), created by blurring the movies with a 100- 
pixel Gaussian filter using FinalCut Pro. For the third stimulus 
set, four words with low visual intelligibility were selected ("give," 
"pail," "theme," "voice") at the same 15 asynchronies. The fourth 
stimulus set contained visual-blurred versions of the low visual 
intelligibility words. 

EXPERIMENTAL DESIGN 

For the first experiment, stimuli from all four stimulus sets were 
presented, randomly interleaved, to a group of 16 subjects in one 
testing session for each subject. The testing session was divided 
into three blocks, with self-paced breaks between each run. Each 
run contained one presentation of each stimulus (8 words x 
15 asynchronies x 2 reliabilities = 240 total stimuli per run). 
Although the stimulus sets were presented intermixed, and a sin- 
gle model was fit to all stimuli together, we discuss the model 
predictions separately for each of the four stimulus sets. 

In the second, replication experiment, the same task and stim- 
uli were presented to a group of 23 subjects (no overlap with 
subjects from Experiment 1). These subjects completed one run, 
resulting in one presentation of each stimulus to each subject. Two 
subjects responded "synchronous" to nearly all stimuli (perhaps 
to complete the task as quickly as possible). These subjects were 
discarded, leaving 21 subjects. 

MODEL PARAMETERS 

The causal inference of multisensory speech (CIMS) model has 
two types of parameters: two subject parameters and four stimu- 
lus parameters (one of which is set to zero, resulting in three fitted 
parameters). The first subject parameter, a, is the noise in the 
measurement of the physical asynchrony (while we assume that 
the measurement noise is Gaussian, other distributions could be 
used easily), and varies across stimulus conditions. Because the 
noisy visual and noisy auditory onset estimates are differenced, 
only a single parameter is needed to estimate the noise in the mea- 
sured asynchrony. As a increases, the precision of the observer's 
measurement of the physical asynchrony decreases. 

The second subject parameter is pc=i, the prior probability 
of a common cause. This parameter is intended to reflect the 
observer's expectation of how often a "common cause" occurs 
in the experiment, and remains fixed across stimulus conditions. 
Holding other parameters constant, a higher pc=i means that the 
observer will more often report synchrony. If observers have no 
systematic bias toward C = 1 or C = 2, we have pc=i = 0.5, and 
the model has only one subject-level parameter. 

The four stimulus parameters reflect the statistics of natural 
speech and are the mean and standard deviation of the C = 1 
(mc=i> o'c=i) and C = 2 distributions (/ic=2> o'c=2)- 

PHYSICAL ASYNCHRONY vs. AUDITORY/VISUAL OFFSET 

In the literature, a common reference frame is to consider the 
manipulations made to speech recordings when generating exper- 
imental stimuli (Vroomen and Keetels, 2010). In this convention, 
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natural speech is assigned an audio/visual offset of zero. These 
two reference frames (physical vs. stimulus manipulation) dif- 
fer by a small positive shift that varies slightly across different 
words and talkers. This variability in the C = 1 distribution is 
accounted for in the model with a narrow distribution of physical 
asynchronies. Talkers consistently begin their (visually-observed) 
mouth movements slightly before beginning the auditory part of 
the vocalization. This results in a narrow asynchrony distribu- 
tion for C = 1 (small crc=i). In contrast, if the visually-observed 
mouth movements arise from a different talker than the auditory 
vocalization, we expect no relationship between them and a broad 
asynchrony distribution for C = 2 (large 00=2) ■ We adopt the 
widely-used stimulus manipulation reference frame and define 
/xc=i as zero, which leads to a negative value for /xc=2 (adopting 
the other reference frame does not change the results). 

MODEL FiniNG AND COMPARISON 

All model fitting was done in R (R Core Team, 2012). The source 
code for all models is freely available on the authors' web site 
(http://0penwetware.0rg/wiki/Beauchamp:CIMS). Only a single 
model was fit for each subject across all stimulus conditions. The 
input to the model fitting procedures was the number of times 
each physical asynchrony was classified as synchronous across all 
runs. All model parameters were obtained via maximization of 
the binomial log-likelihood function on the observed data. 

For the CIMS model, we used a multi-step optimization 
approach. In the first step, we found the best-fitting subject 
parameters (pc=i and a ) for each subject and stimulus set based 
on random initial values for the stimulus parameters {crc=i, 
(7c=2i /^c=2)- In the second step, we found the best-fitting stim- 
ulus parameters based on the fitted pc=i and a values. These 
steps were repeated until the best-fitting model was obtained. 
Because the experimental manipulations were designed to affect 
sensory reliability, we fit a separate a in each condition, result- 
ing in a total of 8 free parameters for the CIMS model. This 
hierarchical fitting procedure was used because some parame- 
ters were consistent across conditions, allowing the fitting pro- 
cedure to converge on the best fitting model more quickly. 
We refit the model using 256 initial positions for the stimu- 
lus parameters to guard against fitting to local optima. Visual 
inspection of individual fits confirmed the model was not fit- 
ting obviously sub-optimal subject parameters. Finally, we con- 
firmed the ability of the fitting procedure to recover the max- 
imum likelihood estimates for each parameter using simulated 
data. 

We compared the CIMS model with a curve-fitting approach 
taken in previous studies of audiovisual synchrony judgments, 
which we term the Gaussian model. In the Gaussian model, a 
scaled Gaussian probability density curve is fit to each subject's 
synchrony judgment curve. Each subject's synchrony judgment 
curve is therefore characterized by the three parameters: the mean 
value, the standard deviation, and a scale parameter that reflects 
the maximum rate of synchrony perception. Because the Gaussian 
model does not make any predictions about the relationship 
between conditions, we follow previous studies and fit indepen- 
dent sets of parameters between stimulus sets. The Gaussian 
model contains a total of 12 free parameters across all conditions 
in this experiment. 



After fitting the models to the behavioral data, we compared 
them based on how well their predictions matched the observed 
data. We show the mean model predictions with the mean 
behavioral data to assess qualitative fit. Because these predicted 
data are averages, however, they are not a reasonable indicator of 
a model's fit to any individual subject. A model that overpredicts 
synchrony for some subjects and underpredicts for others may 
have a better mean curve than a model that slightly overpredicts 
more often than underpredicts. To ensure the models accurately 
reproduce individual-level phenomena, we assess model fit by 
aggregating error across individual-level model fits. In experi- 
ment 1 we provide a detailed model comparison for each stimulus 
set using the Bayesian Information Criterion (BIG). BIG is a lin- 
ear transform of the negative log-likelihood that penalizes models 
based on the number of free parameters and trials, with lower 
BIG values corresponding to better model fits. For each model we 
divide the penalty term evenly across the conditions, so that the 
CIMS model is penalized for 2 parameters per condition and the 
Gaussian model is penalized for 3 parameters per condition. To 
compare model fits across all stimulus conditions we considered 
both individual BIG across all conditions and the group mean 
BIG, calculated by summing the BIG for each condition across 
subjects, then taking the mean across subjects. In Experiment 2, 
we compare group mean BIG across all conditions. Conventional 
significance tests on the BIG differences between models were also 
performed. In the model comparison figures, error is the within- 
subject standard error, calculated as in Loftus and Masson ( 1994). 

MODEL DERIVATION 

In the generative model, there are two possible states of the 
world: C = 1 (single talker) and C = 2 (two talkers). The prior 
probability of C = 1 is pc=i- Asynchrony, denoted A, has a dif- 
ferent distribution under each state. Both are assumed to be 
Gaussian, such that p(A|C = 1) = N (A; 0, a^) and p(A|C = 
2) =N(A; 112,02)- The observer's noisy measurement of A, 
denoted x, is also Gaussian, p(x| A) = N{x; A, a^) where the 
variance is the combined variance from the auditory and 
visual cues. This specifies the statistical structure of the task. 
In the inference model, the observer infers C from x. This is 
most easily expressed as a log posterior ratio, d = log = 

The optimal decision rule is d > 0. If we assume that cti < o'2> 
then the optimal decision rule becomes \x + 112 "2 ^ < 



2 log 



pc = i 



+ iog; 



+ 



i4 



and the probabil- 



ity of reporting a common cause for a given asynchrony 
is p(C = 1|A) = Normcdfix; 



Normcdf \x\ 



M2: 



RESULTS 

BEHAVIORAL RESULTS FROM EXPERIMENT 1 

The CIMS model makes trial-to-trial behavioral predictions 
about synchrony perception using a limited number of 
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parameters that capture physical properties of speech, the sen- 
sory noise of the subject, and the subject's prior assumptions 
about the causal structure of the stimuli in the experiment. The 
Gaussian model simply fits a Gaussian curve to the behavioral 
data. We tested the CIMS model and Gaussian model against 
behavioral data from subjects viewing movie clips of audiovisual 
speech with varying asynchrony, visual intelligibility (high/low) 
and visual reliability (reliable/blurred). 

Visual reliable, high visual intelligibility words 

We first compared synchrony judgments for 16 subjects with 
visual reliable, high visual intelligibility words. Synchronous 
responses were 0.90 or higher for visual-leading asynchronies 
from -1-67 to -1-133 ms, but dropped off for higher or lower 
asynchronies (Figure 3A). The general shape of the curve is con- 
sistent with previous reports of simultaneity judgments with these 
stimuli (Conrey and Pisoni, 2006). 

A single CIMS model and Gaussian model were fit to each sub- 
ject across all stimulus conditions and the predicted synchrony 
reports were averaged to produce mean predictions (Figure 3A). 
To provide a quantitative comparison of the model fits, we com- 
pared the BIG of both models for each subject (Figure 3B). The 
BIG measure was in favor of the GIMS model, with a mean 
difference of 5.8 ± 1.8 (SEM). A paired f-test showed that the 
difference was reliable [f(i5) = 3.16, p = 0.006]. 

The better fit of the GIMS models is caused by its ability 
to predict a range of asynchronies that are perceived as nearly 
synchronous (rather than one peak) and an asymmetric syn- 
chrony judgment curve. These features are consequences of the 
model structure, not explicit parameters of the model. The pres- 
ence of noise in the sensory system means that even when the 
physical asynchrony is identical to the mean of the common 
cause distribution there is still a chance the measured asynchrony 
wiU be outside the synchrony window. Having an asymmetric. 



broad range of synchronies reported as nearly synchronous is pre- 
dicted by the interaction of the observer's prior belief about the 
prevalence of a common cause in this experiment and sensory 
noise. 

Visual blurred, high visual intelligibility words 

Because blurring decreases the reliability of the visual speech, 
the GIMS model predicts that the sensory noise level should 
increase, resulting in changes in synchrony perception primar- 
ily at larger asynchronies. Despite the blurring, the peak of the 
synchrony judgment curve remained high (around 0.9 reported 
synchrony for +67 to -1-133 ms) showing that participants were 
still able to perform the task. However, the distribution had a flat- 
ter top, with participants reporting high synchrony values for a 
broad range of physical asynchronies that extended from 0 ms 
(no audio/visual offset) to -1-267 ms (Figure 3C). The drop-off 
in reported synchrony was more asymmetric than for unblurred 
stimuli, dropping more slowly for the visual-leading side of the 
curve. Comparing the model fits to the behavioral data, the GIMS 
model was supported (Figure 3D) over the Gaussian model [BIG 
difference: 4.1 ± 1.3, f(i5) = 3.1, p = 0.007]. Blurring the visual 
speech makes estimation of the visual onset harder by adding 
uncertainty to the observer's estimate of the visual onset. For the 
GIMS model, this has the effect of increasing variability, leading 
to a widening of the predicted behavioral curves and an exagger- 
ation of their asymmetry. In contrast, for the Gaussian model, 
increasing the standard deviation can only symmetrically widen 
the fitted curves. 

Visual reliable, low visual intelligibility words 

In the GIMS model, words with low visual intelligibility should 
decrease the certainty of the visual speech onset, corresponding 
to an increase in sensory noise. Decreasing visual intelligibility 
both widened and flattened the peak of the synchrony judgment 
curve (Figure 4A), resulting in a broad plateau from —67 ms to 
-1-133 ms. When the GIMS and Gaussian model fits were com- 
pared (Figure 4B), the GIMS model provided a better fit to 
the behavioral data [BIG difference: 9.9 ± 2.0, f(i5) = 4.9, p < 
0.001 ] . The GIMS model accurately predicts the plateau observed 
in the behavioral data, in which a range of small asynchronies 
are reported as synchronous with high probability. The Gaussian 
model attempts to fit this plateau through an increased standard 
deviation, but this resulted in over-estimating synchrony reports 
at greater asynchronies. 

Visual blurred, low visual intelligibility words 

The GIMS model predicts that visual blurring should be cap- 
tured solely by a change in sensory noise. Blurring led to an 
increase in synchrony reports primarily for the larger asyn- 
chronies (Figure 4C). The height and location of the plateau 
of the curve was similar to the unblurred versions of these 
words (between 0.89 and 0.93 synchrony reports from —67ms 
to -1-133 ms). Overall, blurring the low visual intelligibility words 
had generally the same effect as blurring the high visual intel- 
ligibility words: widening the synchrony judgment curve, but 
not changing the height or position of the curve's plateau. The 
GIMS model fit the behavioral data better (Figure 4D) than 
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FIGURE 3 I Model fits to behavioral data for experiment 1 (high visual 
intelligibility words). (A) Black circles show the behavioral data from 16 
subjects performing a synchrony judgment task (mean ± standard error) for 
each stimulus asynchrony with visual reliable, high visual intelligibility 
stimuli. Curves show the model predictions for the CIMS model (orange) 
and Gaussian model (blue). (B) Fit error measured with Bayesian 
Information Criterion (BIC) for the CIMS and Gaussian models; lower values 
indicate a better fit for the CIMS model (**p = 0.006). Error bars show 
within-subject standard error (Loftus and Masson, 1994). (C) Mean 
proportion of synchrony responses and model predictions for visual blurred, 
high visual intelligibility stimuli. (D) Fit error for the CIMS and Gaussian 
models, showing better fit for the CIMS model ("p = 0.007). 
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AudioA/isual Offset (ms) 

FIGURE 4 I Model fits to behavioral data for Experiment 1 (low visual 
intelligibility words). (A) Black circles show the behavioral data with visual 
reliable, low visual intelligibility stimuli, curves show model predictions for 
CIMS (orange) and Gaussian (blue) models. (B) Fit error showing 
significantly better fit for the CIMS model (**p < 0.001 ). (C) Mean 
proportion of synchrony responses and model predictions for visual blurred, 
low visual intelligibility stimuli. (D) Fit error showing significantly better fit 
for the CIMS model (**p = 0.001 ). 



the Gaussian model [BIG difference: 4.7 ± 1.2, f(i5) = 3.9, p = 
0.001]. The better fit resuhed from the GIMS model's ability to 
predict an asymmetric effect of blurring and continued predic- 
tion of a wide range of asynchronies reported as synchronous with 
high probability. 

Overall model testing 

Next, we compared the models across all stimulus sets together. 
For the GIMS model, the parameters (Tc=1) crc=2> Mc=2) Pc=i 
for each subject are fit across all conditions, placing constraints 
on how much synchrony perception can vary across conditions 
(the Gaussian model has no such constraints because a sepa- 
rate scaled Gaussian is fit for each condition). We compared 
the models across all conditions using the average of the total 
BIG (summed across stimulus sets) across subjects (Figure 5A). 
Despite the additional constraints of the GIMS model, the overall 
model test supported it over the Gaussian model [BIG difference: 
24.5 ± 4.9, f(i5) = 4.99, p < 0.001]. The direction of this differ- 
ence was replicated across all 16 subjects (Figure 5B), although 
the magnitude showed a large range (range of BIG differences: 
2-81). 

It is tempting to compare the fit error across different stimu- 
lus conditions within each model to note, for instance, that the 
fit error for visual blurring is greater than without it; or that the 
models fit better with low visual intelligibility words. However, 
these comparisons must be made with caution because the stim- 
ulus level parameters are fit across all conditions simultaneously 
and are thus highly dependent. For instance, removing the visual 
intelligibility manipulation would change the parameter esti- 
mates and resulting fit error for the visual blurring condition. 
The only conclusion that can be safely drawn is that the GIMS 
model provides a better fit than the Gaussian model for all tested 
stimulus manipulations. 

Interpreting parameters from the CIMS model 

A key property of the GIMS model is the complete specification 
of the synchrony judgment task structure, so that model param- 
eters may have a meaningful link to the cognitive and neural 
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processes that instantiate them. First, we examined how a, the 
sensory noise parameter, changed across stimulus conditions and 
subjects (Figure 6). 

The fitted value of cr is a measure of the sensory noise level 
in each individual and condition, and captures individual differ- 
ences in the task. To demonstrate the within-subject relationship 
across conditions, we correlated the a for the high and low 
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visual intelligibility conditions across subjects. We found a very 
high correlation (Figures 6A,B) for both the blurred stimuli (r = 
0.92, p < 10"*) and the unblurred stimuh (r = 0.95, p < 10"^). 
This correlation demonstrates that subjects have a consistent 
level of sensory noise: some subjects have a low level of sensory 
noise across different stimulus manipulations, while others have a 
higher level. In our study, the subjects were healthy controls, lead- 
ing to a modest range of a across subjects (70-300 ms). Although 
we fit the model with a restricted range of stimuli (congruent 
audiovisual words with high or low visual reliability and intelli- 
gibility), subjects with high sensory noise might show differences 
from subjects with low sensory noise across a range of multi- 
sensory speech tasks, such as perception of the McGurk effect 
(Stevenson et al., 2012). In clinical populations, subjects with lan- 
guage impairments (such as those with ASD or dyslexia) might be 
expected to have higher sensory noise values and a larger variance 
across subjects. 

Next, we examined the average value of a in different condi- 
tions (Figure 6C). Blurring the visual speech should lead to an 
increase in sensory noise, as the blurred stimuli provide less reli- 
able information about the onset time of the visual speech. Words 
with lower visual intelligibility should also cause an increase in 
sensory noise, as the visual speech information is more ambigu- 
ous and its onset harder to estimate. As expected, a was higher 
for blurred stimuli (mean increase of 10 ms) and low visual intel- 
ligibility words (mean increase of 12 ms). A repeated measures 
ANOVA on the fitted a values with visual reliability (reliable or 
blurred) and visual intelligibility (high or low) as factors showed 
a marginally reliable effect of visual reliability [_F(i, 15) = 4.51, 
p = 0.051], a main effect of visual intelligibility [_F(i, 15) = 7.16, 
p = 0.020] and no interaction. 

An additional parameter of the CIMS model is pc=i, which 
represents the observer's prior belief that audio and visual speech 
events arise from a common cause. Higher values indicate a 
higher probability of inferring a common cause (and there- 
fore of responding synchronous). Across all stimulus conditions, 
subjects' priors were biased toward reporting one cause [Mean 
pc=\ = 0.58; f-test against 0.5; f(i5) = 4.47, p < 0.001]. A prior 
biased toward reporting a common cause may be due to the pre- 
sentation of a single movie clip in each trial and the same talker 
across trials. Having a high prior for C = 1 increases the proba- 
bility of responding synchronous even for very high asynchronies, 
leading to the observed behavioral effect of non-zero reported 
synchrony even at very large asynchronies. 

Finally, we examined model parameters that relate to the nat- 
ural statistics of audiovisual speech. Across all participants, the 
standard deviation of the common and separate cause distribu- 
tions were estimated to be crc=i = 65 ± 9 ms (SEM) and ac=2 = 
126 ± 12 ms. For consistency with the literature, we used the 
stimulus manipulation reference frame and fixed iic=i at zero, 
resulting in a fitted value for /xc=2 of —48 ± 12 ms (using the 
physical asynchrony reference frame would result in a value for 
Mc=2 near zero and a positive value for /xc=i). 

EXPERIMENT 2 

Results from the first experiment demonstrated that the CIMS 
model describes audiovisual synchrony judgments better than the 



Gaussian model under manipulations of temporal asynchrony, 
visual blurring, and reduced visual intelligibility. One notable dif- 
ference between the two models is the use of parameters in the 
CIMS model that are designed to reflect aspects of the natu- 
ral statistics of audiovisual speech. If these stimulus parameters 
are reflective of natural speech statistics, they should be rela- 
tively consistent across different individuals tested with the same 
stimuli. To test this assertion, we fit the CIMS model to an inde- 
pendent set of 21 subjects using the mean values from experiment 
1 for the stimulus parameters (Tc=i, oc=i, and /xc=2 (mc=i 
remained fixed at zero). We then compared the fits of this reduced 
CIMS model with the fits from the Gaussian model using the 
behavioral data from the 21 subjects. In experiment 2, the CIMS 
model has 7 fewer parameters per subject than the Gaussian model 
(5 vs. 12). 

Behavioral results and model testing 

Overall behavioral results were similar to Experiment 1 . The BIG 
measure favored the CIMS model both for the group (BIG dif- 
ference: 33.8 ± 1.3; Figure 5C) and in each of the 21 subjects 
(Figure 5D). A paired f-test on the BIG values confirmed that 
CIMS was the better fitting model [t(20) = 25.70, p < 10"^^]. 
There is a noticeable difference in the average total BIG between 
experiments. Because the calculation of BIG scales with the num- 
ber of trials, with 4 trials per subject in Experiment 2 and 12 trials 
per subject in Experiment 1, the magnitudes will necessarily be 
larger in Experiment 1 and cannot be directly compared. 

The better fit for the CIMS model in this experiment shows 
that the model is reproducing essential features of synchrony per- 
ception with fewer parameters than the Gaussian model. If both 
models were simply curve-fitting, we would expect the model 
with more free parameters to perform better. Instead, the CIMS 
model makes explicit predictions that some parameters should 
remain fixed across conditions and provides an explanation for 
the shape of the synchrony judgment data. 

DISCUSSION 

The CIMS model prescribes how observers should combine infor- 
mation from multiple cues in order to optimally perceive audiovi- 
sual speech. Our study builds on previous examinations of causal 
inference during audio-visual multisensory integration but pro- 
vides important advances. Previous work has demonstrated that 
causal inference can explain behavioral properties of audiovi- 
sual spatial localization using simple auditory beeps and visual 
flashes (Kording et al, 2007; Sato et al., 2007). We use the same 
theoretical framework, but for a different problem, namely the 
task of deciding if two speech cues are synchronous. Although 
the problems are mathematically similar, they are likely to be 
subserved by different neural mechanisms. For instance, audiovi- 
sual spatial localization likely occurs in the parietal lobe (Zatorre 
et al., 2002) while multisensory speech perception is thought 
to occur in the superior temporal sulcus (Beauchamp et al, 
2004). Different brain areas might solve the causal inference prob- 
lem in different ways, and these different implementations are 
likely to have behavioral consequences. For instance, changing 
from simple beep/flash stimuli to more complex speech stim- 
uli can change the perception of simultaneity (Love et al, 2013; 
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Stevenson and Wallace, 2013). Because multisensory speech may 
be the most ethologically important sensory stimulus, it is crit- 
ical to develop and test the framework of causal inference for 
multisensory speech perception. 

The CIMS model shows how causal inference on auditory 
and visual signals can provide a mechanistic understanding of 
how humans judge multisensory speech synchrony. The model 
focuses on the asynchrony between the auditory and visual speech 
cues because of the prevalence of synchrony judgment tasks in 
the audiovisual speech literature and its utility in characteriz- 
ing speech perception in healthy subjects and clinical populations 
(Lachs and Hernandez, 1998; Conrey and Pisoni, 2006; Smith and 
Bennetto, 2007; Rouger et al., 2008; Foss-Feig et al., 2010; Navarra 
et al, 2010; Stevenson et al., 2010; Vroomen and Keetels, 2010). 
A key feature of the CIMS model is that it is based on a principled 
analysis of how an optimal observer should solve the synchrony 
judgment problem. This feature allows it to make predictions 
(such as the stimulus-level parameters remaining constant across 
groups of observers) that can never be made by post-hoc curve- 
fitting procedures, Gaussian or otherwise. Hence, the model acts 
as a bridge between primarily empirical studies that examine 
subjects' behavior (Wallace et al., 2004; Navarra et al., 2010; 
Hillock-Dunn and Wallace, 2012) under a variety of different 
multisensory conditions and more theoretical studies that focus 
on Bayes-optimal models of perception (Kording et al., 2007; 
Shams and Beierholm, 2010). 

Across manipulations of visual reliability and visual intel- 
ligibility, the model fit better than the Gaussian curve-fitting 
approach, even when it had many fewer parameters. Unlike 
the Gaussian approach, the parameters of the CIMS model are 
directly related to the underlying decision rule. These parame- 
ters, such as the subject's sensory noise, beliefs about the task, and 
structural knowledge of audiovisual speech can be used to char- 
acterize individual and group differences in multisensory speech 
perception. 

An interesting observation is that the individual differences 
in sensory noise across subjects (range of 70-300 ms) was much 
greater than the change in sensory noise within individuals caused 
by stimulus manipulations (~30ms). This means that results 
from only one condition may be sufficient to study individual 
differences in synchrony perception. In some populations, it is 
prohibitively difficult to collect a large number of trials in many 
separate conditions. A measure of sensory noise that is obtainable 
from only one condition could therefore be especially useful for 
studying these populations. 

CAUSAL INFERENCE PREDICTS FEATURES OF AUDIOVISUAL SPEECH 
SYNCHRONY JUDGMENTS 

The CIMS model explains synchrony perception as an inference 
about the causal relationship between two events. Several features 
of synchrony judgment curves emerge directly from the compu- 
tation of this inference process. First, the presence of uncertainty 
in the sensory system leads to a broad distribution of synchrony 
responses rather than a single peak near mc=i ■ When an observer 
hears and sees a talker, the measured asynchrony is corrupted by 
sensory noise. The optimal observer takes this noise into account 
and makes an inference about the likelihood that the auditory 



and visual speech arose from the same talker, and therefore, 
are synchronous. This decision process can lead to an overall 
synchrony judgment curve with a noticeably flattened peak, as 
observed behaviorally. 

Second, the rightward shift (toward visual-leading asyn- 
chronies) of the maximal point of synchrony is explained 
by the natural statistics of audiovisual speech coupled with 
noise in the sensory system. Because the mean of the com- 
mon cause distribution (/xc=i) is over the visual-leading asyn- 
chronies, small positive asynchronies are more consistent with 
a common cause than small negative asynchronies. This fea- 
ture of the synchrony judgment curve is enhanced by the 
location of the C = 2 distribution at a physical asynchrony 
of 0 ms. 

WHAT ABOUT THE TEMPORAL BINDING WINDOW AND THE MEAN 
POINT OF SYNCHRONY? 

The Gaussian model is used to obtain measures of the tem- 
poral binding window and mean point of synchrony in order 
to compare individuals and groups. In our formulation of the 
CIMS model, we introduced the Bayes-optimal synchrony win- 
dow. This synchrony window should not be confused with the 
temporal binding window. The temporal binding window is pred- 
icated on the idea that observers have access to the physical 
asynchrony of the stimulus, which cannot be correct: observers 
only have access to a noisy representation of the world. The 
CIMS model avoids this fallacy by defining a synchrony win- 
dow based on the observer's noisy measurement of the physical 
asynchrony. The predicted synchrony reports from the CIMS 
model therefore relate to the probability that a measured asyn- 
chrony will land within the Bayes-optimal window, not whether 
a physical asynchrony is sufficiently small. This distinction is a 
critical difference between the generative modeling approach of 
the CIMS model and the curve-fitting approach of the Gaussian 
model. 

In the CIMS model, the shape of the behavioral curve emerges 
naturally from the assumptions of the model, and is a result of 
interactions between all model parameters. In contrast, the mean 
point of synchrony in the Gaussian model defines a single value 
of the behavioral data. This poses a number of problems. First, 
the behavioral data often show a broad plateau, meaning that a 
lone "peak" mean point of synchrony fails to capture a prominent 
feature of the behavioral data. Second, the location of the cen- 
ter of the behavioral data is not a fixed property of the observer, 
but reflects the contributions of prior beliefs, sensory noise, and 
stimulus characteristics. By separately estimating these contri- 
butions, the CIMS model can make predictions about behavior 
across experiments. 

MODIFICATIONS TO THE GAUSSIAN MODEL 

The general form of the Gaussian model used in this paper has 
been used in many published studies on synchrony judgments 
(Conrey and Pisoni, 2006; Navarra et al., 2010; Vroomen and 
Keetels, 2010; Baskent and Bazo, 201 1; Love et al, 2013). It is pos- 
sible to modify the Gaussian model used in this paper to improve 
its fit to the behavioral data, for instance by fitting each side 
of the synchrony judgment curve with separate Gaussian curves 
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(Powers et al, 2009; Hillock et al, 2011; Stevenson et al, 2013). In 
the current study, the Gaussian model required 12 parameters per 
subject; fitting each half of the synchrony judgment curve would 
require 20 parameters per subject, more than twice the number 
of parameters in the CIMS model. Additionally, the CIMS model 
required only 4 parameters to characterize changes across exper- 
imental conditions. Although researchers may continue to add 
more flexibility to the Gaussian model to increase its fit to the 
behavioral data (Stevenson and Wallace, 2013), the fundamen- 
tal problem remains: the only definition of model goodness is 
that it fits the behavioral data "better." In the limit, a model with 
as many parameters as data points can exactly fit the data. Such 
models are incapable of providing a deeper understanding of the 
underlying properties of speech perception because their param- 
eters have no relationship with underlying cognitive or neural 
processes. 

Researchers using the Gaussian model could also try to reduce 
the number of free parameters by fixing certain parameters across 
conditions, or using values estimated from independent sam- 
ples. This approach is necessarily posf- /roc, providing no rationale 
about which parameters should remain fixed (or allowed to vary) 
before observing the data. 

CONCLUSION 

The CIMS model affords a quantitative grounding for 
multisensory speech perception that recognizes the fundamental 



role of causal inference in general multisensory integration 
(Kording et al., 2007; Sato et al, 2007; Schutz and Kubovy, 2009; 
Shams and Beierholm, 2010; Buehner, 2012). Most importantly, 
the CIMS model represents a fundamental departure from curve- 
fitting approaches. Rather than focusing on mimicking the shape 
of the observed data, the model suggests how the data are gener- 
ated through a focus on the probabilistic inference problem that 
underlies synchrony perception. The model parameters thus pro- 
vide principled measures of multisensory speech perception that 
can be used in healthy and clinical populations. More generally, 
causal inference models make no assumptions about the nature 
of the stimuli being perceived, but provide strong predictions 
about multisensory integration across time and space. While the 
current incarnation of the CIMS model considers only tempo- 
ral asynchrony, the theoretical framework is amenable to other 
cues that may be used to make causal inference judgments dur- 
ing multisensory speech perception, such as the spatial location 
of the auditory and visual cues, the gender of the auditory and 
visual cues, or the speech envelope. 

ACKNOWLEDGMENTS 

This research was supported by NIH R01NS065395 to Michael S. 
Beauchamp. The authors are grateful to Debshila Basu Mallick, 
Haley Lindsay, and Cara Miekka for assistance with data collec- 
tion and to Luis Hernandez and David Pisoni for sharing the 
asynchronous video stimuli. 



REFERENCES 

Baskent, D., and Bazo, D. (2011). 
Audiovisual asynchrony detec- 
tion and speech intelligibility in 
noise with moderate to severe 
sensorineural hearing impair- 
ment. Ear Hear. 32, 582-592. doi: 
10.1097/AUD.0b013e31820fca23 

Beauchamp, M. S., Argall, B. D., 
Bodurka, J., Duyn, J. H., and 
Martin, A. (2004). Unraveling 
multisensory integration: patchy 
organization within human STS 
multisensory cortex. Nat. Neurosci. 
7, 1190-1192. doi: 10.1038/ 
nnl333 

Bejjanki, V. R., Clayards, M., KniE, 
D. C, and Aslin, R. N. (2011). 
Cue integration in categorical tasks: 
insights from audio-visual speech 
perception. PLoS ONE 6:el9812. 
doi: 10.1371/journal.pone.0019812 

Brainard, D. H. (1997). The psy- 
chophysics toolbox. Spat. Vis. 
10, 433^36. doi: 10.1163/ 
156856897X00357 

Buehner, M. ). (2012). Understanding 
the past, predicting the future: 
causation, not intentional action, 
is the root of temporal binding. 
Psychol Sci. 23, 1490-1497. doi: 
10.1177/0956797612444612 

Conrey, B., and Pisoni, D. B. (2006). 
Auditory-visual speech perception 
and synchrony detection for speech 
and nonspeech signals. /. Acoust. 



Soc. Am. 119, 4065-4073. doi: 
10.1121/1.2195091 

Conrey, B. L., and Pisoni, D. B. (2004). 
"Detection of auditory-visual asyn- 
chrony in speech and nonspeech 
signals," in Research on Spoken 
Language Processing Progress Report 
No 26, (Bloomington, IN: Indiana 
University), 71-94. 

Foss-Feig, ). H., Kwakye, L. D., Cascio, 
C. I., Burnette, C. P, Kadivar, 
H., Stone, W. L., et al. (2010). 
An extended multisensory tempo- 
ral binding window in autism spec- 
trum disorders. Exp. Brain Res. 203, 
381-389. doi: 10.1007/s00221-010- 
2240-4 

Hillock, A. R., Powers, A. R., and 
Wallace, M. T. (2011). Binding 
of sights and sounds: age-related 
changes in multisensory temporal 
processing. Neuropsychologia 49, 
461-467. doi: 10.1016/j.neuro 
psychologia.2010.11.041 

HiUock-Dunn, A., and Wallace, M. T. 
(2012). Developmental changes in 
the multisensory temporal binding 
window persist into adolescence. 
Dev. Sci. 15, 688-696. doi: 10.1111/ 
j.l467-7687.2012.01171.x 

Kording, K. P., Beierholm, U., Ma, 
W. Quartz, S., Tenenbaum, ). 
B., and Shams, L. (2007). Causal 
inference in multisensory per- 
ception. PLoS ONE 2:e943. doi: 
10.1371/journal.pone.0000943 



Lachs, L., and Hernandez, L. R. (1998). 
"Update: the Hoosier audiovisual 
multitalker database," in Research on 
Spoken Language Processing Progress 
Report No 22, (Bloomington, IN), 
377-388. 

Loftus, G. R., and Masson, M. E. 
(1994). Using confidence inter- 
vals in within-subject designs. 
Psychon. Bull. Rev. 1, 476-490. doi: 
10.3758/BF03210951 

Love, S. A., Petrini, K., Cheng, A., 
and PolUck, F E. (2013). A psy- 
chophysical investigation of dif- 
ferences between synchrony and 
temporal order judgments. PLoS 
ONE 8:e54798. doi: 10.1371/jour- 
nal.pone.0054798 

Ma, W. 1. (2012). Organizing prob- 
abilistic models of perception. 
Trends Cogn. Sci. 16, 511-518. doi: 
10.1016/j.tics.2012.08.010 

Ma, W. 1., Zhou, X., Ross, L. A., 
Foxe, 1. 1., and Parra, L. C. (2009). 
Lip-reading aids word recognition 
most in moderate noise: a Bayesian 
explanation using high-dimensional 
feature space. PLoS ONE 4:e4638. 
doi: 10.1371/journal.pone.0004638 

Massaro, D. W. (1989). Testing between 
the TRACE model and the fuzzy 
logical model of speech perception. 
Cogn. Psychol 21, 398-421. doi: 
10.1016/0010-0285(89)90014-5 

Massaro, D. W, Cohen, M. M., 
Campbell, C. S., and Rodriguez, T. 



(2001). Bayes factor of model 
selection validates FLMP. 
Psychon. Bull Rev 8, 1-17. doi: 
10.3758/BF03196136 
Navarra, ]., Alsius, A., Velasco, I., Soto- 
Faraco, S., and Spence, C. (2010). 
Perception of audiovisual speech 
synchrony for native and non- 
native language. Brain Res. 1323, 
84-93. doi: 10.1016/j.brainres.2010. 
01.059 

Pelli, D. G. (1997). The VideoToolbox 
software for visual psychophysics: 
transforming numbers into 
movies. Spat Vis. 10, 437-442. 
doi: 10.1163/156856897X00366 

Powers, A. R. 3rd., Hillock, A. R., and 
Wallace, M. T. (2009). Perceptual 
training narrows the temporal 
window of multisensory binding. 
/. Neurosci. 29, 12265-12274. doi: 
10. 1 523/JNEUROSC1.350 1 -09.2009 

R Core Team. (2012). R: A Language 
and Environment for Statistical 
Computing. Vienna: R Foundation 
for Statistical Computing. 

Rosenblum, L. D., lohnson, J. A., 
and Saldana, H. M. (1996). 
Point-light facial displays enhance 
comprehension of speech in noise. 
/. SpeechHear. Res. 39, 1159-1170. 

Rouger, Fraysse, B., Deguine, O., 
and Barone, P (2008). McGurk 
effects in cochlear-implanted deaf 
subjects. Brain Res. 1188, 87-99. 
doi: 10. 1016/j.brainres.2007. 10.049 



www.frontiersin.org 



November 2013 | Volume 4 1 Article 798 | 9 



Magnotti et al. 



Causal inference of asynchronous audiovisual speech 



Sato, Y., Toyoizumi, T., and Aihara, 
K. (2007). Bayesian inference 
explains perception of unity and 
ventriloquism aftereffect: iden- 
tification of common sources 
of audiovisual stimuli. Neural 
Comput. 19, 3335-3355. doi: 
10.1 162/neco.2007.19.12. 3335 

Schutz, M., and Kubovy, M. (2009). 
Causality and cross-modal inte- 
gration. /. Exp. Psychol. Hum. 
Percept. Perform. 35, 1791-1810. 
doi: 10.1037/a0016455 

Schwartz, J. L., Berthommier, R, and 
Savariaux, C. (2004). Seeing to 
hear better; evidence for early 
audio-visual interactions in speech 
identification. Cognition 93, 
B69-B78. doi: lO.lOie/j.cognition. 
2004.01.006 

Shams, L., and Beierholm, U. R. (2010). 
Causal inference in perception. 
Trends Cogn. Sci. 14, 425—432. doi: 
10.1016/j.tics.2010.07.001 

Smith, E. G., and Bennetto, L. (2007). 
Audiovisual speech integra- 
tion and lipreading in autism. 
/. Child Psychol. Psychiatry 48, 



813-821. doi: lO.llll/i.1469-7610. 
2007.01766.x 

Stevenson, R. A., Altieri, N. A., 
Kim, S., Pisoni, D. B., and 
James, T. W. (2010). Neural 
processing of asynchronous 
audiovisual speech perception. 
Neuroimagc 49, 3308-3318. doi: 
10.101 6/j.neuroimage.2009. 1 2.00 1 

Stevenson, R. A., and Wallace, M. 
T. (2013). Multisensory temporal 
integration: task and stimulus 
dependencies. Exp. Brain Res. 227, 
249-261. doi: 10.1007/s00221-013- 
3507-3 

Stevenson, R. A., Wilson, M. M., 
Powers, A. R., and Wallace, M. 
T. (2013). The effects of visual 
training on multisensory tempo- 
ral processing. Exp. Brain Res. 225, 
479^89. doi: 10.1007/s00221-012- 
3387-y 

Stevenson, R. A., Zemtsov, R. K., and 
Wallace, M. T. (2012). Individual 
differences in the multisensory 
temporal binding window pre- 
dict susceptibility to audiovisual 
illusions. /. Exp. Psychol. Hum. 



Percept. Perform. 38, 1517-1529. 
doi: 10.1037/a0027339 

Sumby, W. H., and Pollack, I. (1954). 
Visual contribution to speech 
intelligibility in noise. /. Acoust. 
Soc. Am. 26, 212-215. doi: 
10.1121/1.1907309 

Vroomen, J., and Keetels, M. (2010). 
Perception of intersensory syn- 
chrony: a tutorial review. Atten. 
Percept. Psychophys. 72, 871-884. 
doi: 10.3758/APR72.4.871 

Wallace, M. T., Roberson, G. E., 
Hairston, W. D., Stein, B. E., 
Vaughan, J. W, and SchiriUo, J. 
A. (2004). Unifying multisensory 
signals across time and space. 
Exp. Brain Res. 158, 252-258. doi: 
10.1007/S00221-004-1899-9 

Zatorre, R. J., Bouffard, M., Ahad, 
R, and Belin, R (2002). Where 
is 'where' in the human auditory 
cortex? Nat. Neurosci. 5, 905-909. 
doi: 10.1038/nn904 

Conflict of Interest Statement: The 

authors declare that the research 
was conducted in the absence of any 



commercial or financial relationships 
that could be construed as a potential 
conflict of interest. 

Received: 20 August 2013; accepted: 
10 October 2013; published online: 13 
November 2013. 

Citation: Magnotti JF, Ma WJ and 
Beauchamp MS (2013) Causal infer- 
ence of asynchronous audiovisual speech. 
Front Psychol. 4:798. doi: 10.3389/fpsyg. 
2013.00798 

This article was submitted to Perception 
Science, a section of the journal Frontiers 
in Psychology. 

Copyright © 2013 Magnotti, Ma and 
Beauchamp. This is an open-access arti- 
cle distributed under the terms of the 
Creative Commons Attribution License 
(CC BY). The use, distribution or repro- 
duction in other forums is permitted, 
provided the original author(s) or licen- 
sor are credited and that the original 
publication in this journal is cited, in 
accordance with accepted academic prac- 
tice. No use, distribution or reproduction 
is permitted which does not comply with 
these terms. 



Frontiers in Psychology | Perception Science 



November 2013 | Volume 4 | Article 798 | 10 



